Author: Mannat Sandhu
This use case provides an in-depth analysis of the New York market (for Jumpman23) and suggests data-driven insights for targeting a 20% growth over the next 2 months.
This is an overview of the salient results of the analysis before I provide a detail of the code.
There are some data integrity issues present with the data. In particular, I have identified that there are issues with:
I have decided to remove this data from the rest of the analysis. This is because I do not know if the data integrity issues for these fields extend to other fields (i.e., if the fields are populated independent of each other).
Background:
Inferences:
All of this is explained in detail with the accompanying code below.
import pandas as pd
df = pd.read_csv('analyze_me.csv')
df.head()
# Import packages to be used later on.
# Adding it here so they can be installed in the beginning on the machine if needed.
import numpy as np
import folium
from folium import plugins
import plotly.offline as py
import plotly.graph_objs as go
import cufflinks as cf
import plotly.figure_factory as ff
cf.go_offline()
py.init_notebook_mode(connected=True)
I have installed them using the commands on my machine:
conda install -c conda-forge folium or pip install foliumpip install plotly pip install cufflinks# Add some functions to draw maps which will be used throughout the analysis.
from folium.plugins import MarkerCluster
def GetMap(dataframe, latitude, longitude):
mean_coordinates = [dataframe.loc[:,latitude].mean(), dataframe.loc[:,longitude].mean()]
folium_map = folium.Map(mean_coordinates, zoom_start=11)
return folium_map
def GetMapWithMarkers(folium_map, coordinates, icon='cutlery'):
marker_cluster = MarkerCluster().add_to(folium_map)
for point in coordinates:
folium.Marker(point,
icon=folium.Icon(color='darkblue', tiles='CartoDB dark_matter', icon_color='white',
icon=icon, angle=0, prefix='fa')).add_to(marker_cluster)
return folium_map
def GetMapWithMarkersFromDataFrame(dataframe, latitude, longitude, icon='cutlery'):
folium_map = GetMap(dataframe, latitude, longitude)
coordinates = dataframe[[latitude, longitude]].to_numpy()
folium_map = GetMapWithMarkers(folium_map, coordinates, icon)
return folium_map
def GetHeatMapFromDataFrame(dataframe, latitude, longitude):
folium_map = GetMap(dataframe, latitude, longitude)
coordinates = dataframe[[latitude, longitude]].to_numpy()
folium_map.add_child(plugins.HeatMap(coordinates, radius=15))
return folium_map
df.info()
# Convert datetime and timedelta columns to correct format for easy analysis.
date_time_columns = ['when_the_delivery_started','when_the_Jumpman_arrived_at_pickup',
'when_the_Jumpman_left_pickup', 'when_the_Jumpman_arrived_at_dropoff']
df[date_time_columns] = df[date_time_columns].apply(pd.to_datetime)
time_delta_columns = ['how_long_it_took_to_order']
df[time_delta_columns] = df[time_delta_columns].apply(pd.to_timedelta)
date_time_columns.extend(time_delta_columns)
date_time_columns
# To begin, check how many rows do not have these populated:
df[date_time_columns].isnull().sum()
There are 550 rows where the data for the columns 'when_the_Jumpman_arrived_at_pickup' and 'when_the_Jumpman_left_pickup' is not populated each.
Moreover there are 2945 rows where 'how_long_it_took_to_order' is not populated.
I will call this as problematic_data_null_datetime and look for any patterns to the anomalies by exploration.
problematic_data_null_datetime = df[df[date_time_columns].isnull().any(axis=1)]
clean_df = df[~df[date_time_columns].isnull().any(axis=1)]
Let me check if the null populated data is from a specific location.
# Check if the problematic data is from a particular pickup location.
GetHeatMapFromDataFrame(problematic_data_null_datetime, 'pickup_lat', 'pickup_lon')
# Check if the problematic data is from a particular dropoff location.
GetHeatMapFromDataFrame(problematic_data_null_datetime, 'dropoff_lat', 'dropoff_lon')
It doesn't seem like the problematic data is not from a particular location. The heat map seems somewhat in line with the overall heat map - there are no anomalies that stand out.
Let's dig a bit deeper:
def GetPercentageOfUniqueDataInSample(sample_df, complete_df, column_name):
return sample_df[column_name].unique().shape[0] / complete_df[column_name].unique().shape[0] * 100
print("Percentage of unique jumpmen in the null populated data: " +
str(GetPercentageOfUniqueDataInSample(problematic_data_null_datetime, df, 'jumpman_id')))
print("Percentage of unique customers in the null populated data: " +
str(GetPercentageOfUniqueDataInSample(problematic_data_null_datetime, df, 'customer_id')))
print("Percentage of unique cehicles in the null populated data: " +
str(GetPercentageOfUniqueDataInSample(problematic_data_null_datetime, df, 'vehicle_type')))
print("Percentage of unique place categories in the null populated data: " +
str(GetPercentageOfUniqueDataInSample(problematic_data_null_datetime, df, 'place_category')))
So, this is a fairly widespread problem which happens for all vehicle types, and for a large proportion of jumpmen, customers and places.
Most likely, this cannot be traced to a particular group based on location or any of these factors.
We want to make sure that the datetime and timedelta columns seem to be doing the right thing. One way to check that is by making sure that sequence of times is:
Finally, we also want how_long_it_took_to_order > 0.
print("Case 1 (when_the_delivery_started > when_the_Jumpman_arrived_at_pickup): " +
str(clean_df[clean_df.when_the_delivery_started > clean_df.when_the_Jumpman_arrived_at_pickup].shape[0]))
print("Case 2 (when_the_delivery_started > when_the_Jumpman_left_pickup): " +
str(clean_df[clean_df.when_the_delivery_started > clean_df.when_the_Jumpman_left_pickup].shape[0]))
print("Case 3 (when_the_delivery_started > when_the_Jumpman_arrived_at_dropoff): " +
str(clean_df[clean_df.when_the_delivery_started > clean_df.when_the_Jumpman_arrived_at_dropoff].shape[0]))
print("Case 4 (when_the_Jumpman_arrived_at_pickup > when_the_Jumpman_left_pickup): " +
str(clean_df[clean_df.when_the_Jumpman_arrived_at_pickup > clean_df.when_the_Jumpman_left_pickup].shape[0]))
print("Case 5 (when_the_Jumpman_arrived_at_pickup > when_the_Jumpman_arrived_at_dropoff): " +
str(clean_df[clean_df.when_the_Jumpman_arrived_at_pickup > clean_df.when_the_Jumpman_arrived_at_dropoff].shape[0]))
print("Case 6 (when_the_Jumpman_left_pickup > when_the_Jumpman_arrived_at_dropoff): " +
str(clean_df[clean_df.when_the_Jumpman_left_pickup > clean_df.when_the_Jumpman_arrived_at_dropoff].shape[0]))
# Remove this
clean_df = clean_df[~(clean_df.when_the_delivery_started > clean_df.when_the_Jumpman_left_pickup)]
clean_df.shape[0]
There are only 2811 entries which are part of the clean data. Out of these, I will check for data issues with one more condition: how_long_it_took_to_order > 0
clean_df.info()
clean_df['seconds_it_took_to_order'] = clean_df['how_long_it_took_to_order']/np.timedelta64(1, 's')
print("The condition how_long_it_took_to_order > 0 satisfied for all rows: "
+ str(((clean_df['seconds_it_took_to_order'] > 0).all())))
NOTE: For the rest of the analysis, I will only use the clean data. This is because I do not know if the data integrity issues caused by the fields that I have checked are also present in the fields that I have not checked specifically.
In other words, it is very possible that the underlying problem which causes data integrity issues in date/time fields also exists for other fields such as pickup locations/items/etc.
df = clean_df
# Add complete delivery time, from the time the delivery started to when the Jumpman arrives.
df['complete_delivery_time'] = (df['when_the_Jumpman_arrived_at_dropoff'] - df['when_the_delivery_started']).astype('timedelta64[m]')
In this section, I will explore the following KPIs:
To begin with, I will preprocess the data to make it suitable for analysing these KPIs.
def GetPercentageUnique(dataframe, columne_name):
return dataframe[columne_name].unique().shape[0] / dataframe.shape[0] * 100
print("Percentage of unique delivery_id: " + str(GetPercentageUnique(df, 'delivery_id')))
The columns are not designed in a way that every row represents one delivery. By looking at the data, you can see that if you order multiple items, there are multiple rows for that data per item.
So, let us create a new data where we can consider one row per delivery.
delivery_info_columns = ['delivery_id', 'customer_id', 'jumpman_id',
'when_the_delivery_started', 'pickup_place', 'pickup_lat',
'pickup_lon', 'dropoff_lat', 'dropoff_lon', 'vehicle_type',
'complete_delivery_time']
delivery_info_df = df[delivery_info_columns]
delivery_info_df = delivery_info_df.drop_duplicates()
# Verify that the data is indeed deduplicated.
print("Percentage of unique delivery_id in dedeuplicated data: " +
str(GetPercentageUnique(delivery_info_df, 'delivery_id')))
delivery_info_df.info()
We can see that there are no null rows for these items. Hence, we can analyze them without worrying about data integrity issues due to missing values affecting our analysis.
# Data preprocessing - to be used in the analysis easier later.
delivery_info_df['date'] = delivery_info_df['when_the_delivery_started'].dt.date
delivery_info_df['day'] = delivery_info_df['when_the_delivery_started'].dt.day
delivery_info_df['hour'] = delivery_info_df['when_the_delivery_started'].dt.hour
delivery_info_df['dayofweek'] = delivery_info_df['when_the_delivery_started'].dt.dayofweek
delivery_info_df['week'] = delivery_info_df['when_the_delivery_started'].dt.week
average_deliveries_all_month_per_hour = delivery_info_df.groupby(['day','hour'])['delivery_id'].count()
average_deliveries_per_hour = average_deliveries_all_month_per_hour.groupby('hour').mean()
# Make sure that all hours are indeed present in the average_deliveries_orders.
for index in range(24):
if index not in average_deliveries_per_hour:
# Set the value for averaged deliveries for that hour as 0.
average_deliveries_per_hour[index] = 0
# Sort w.r.t. index.
average_deliveries_per_hour = average_deliveries_per_hour.sort_index()
# Plot on bar graph.
import plotly.graph_objects as go
fig = go.Figure(
data=[go.Bar(y=average_deliveries_per_hour, x=['12am', '1am', '2am', '3am', '4am',
'5am', '6am', '7am', '8am', '9am',
'10am', '11am', '12pm', '1pm', '2pm',
'3pm', '4pm', '5pm', '6pm', '7pm',
'8pm', '9pm', '10pm', '11pm'])],
layout_title_text="Average number of deliveries per time of day"
)
fig
As can be seen from the bar graph, the maximum number of deliveries are at ~12pm and ~7pm.
deliveries_all_month_per_week = delivery_info_df.groupby(['week','dayofweek'])['delivery_id'].count()
average_deliveries_per_week = deliveries_all_month_per_week.groupby('dayofweek').mean()
# Make sure that all days of the month are indeed present in the average_deliveries_per_week.
for index in range(7):
if index not in average_deliveries_per_week:
# Set the value for averaged deliveries for that day of the week as 0.
average_deliveries_per_week[index] = 0
# Sort w.r.t. index.
average_deliveries_per_week = average_deliveries_per_week.sort_index()
# Plot on bar graph.
import plotly.graph_objects as go
fig = go.Figure(
data=[go.Bar(y=average_deliveries_per_week, x=['Sun', 'Mon', 'Tues', 'Wed',
'Thurs', 'Fri', 'Sat'])],
layout_title_text="Average deliveries per day of the week"
)
fig
Most of the days of the week have a similar number of deliveries but there is in increase during Thursday, Friday and Saturday (increasing in that order).
merchant_delivery_count = delivery_info_df
merchant_delivery_count = merchant_delivery_count.groupby(['pickup_place']).size().reset_index(name='counts')
merchant_delivery_count = merchant_delivery_count.sort_values(by=['counts'], ascending=False)
merchant_delivery_count = merchant_delivery_count.head(20)
merchant_delivery_count = merchant_delivery_count.sort_values(by=['counts'], ascending=True)
merchant_delivery_count.iplot(x = 'pickup_place', y ='counts', kind='bar',
title='Top merchant delivery frequency in Oct 2014',
yTitle='Merchant name',
xTitle='Total deliveries', orientation='h')
# Visualize pickups for each delivery on the map:
map_pickup_by_delivery = GetMapWithMarkersFromDataFrame(delivery_info_df, 'pickup_lat', 'pickup_lon')
map_pickup_by_delivery
map_dropoff_by_delivery = GetMapWithMarkersFromDataFrame(delivery_info_df, 'dropoff_lat', 'dropoff_lon', 'home')
map_dropoff_by_delivery
pickup_heat_map = GetHeatMapFromDataFrame(delivery_info_df, 'pickup_lat', 'pickup_lon')
pickup_heat_map
dropoff_heat_map = GetHeatMapFromDataFrame(delivery_info_df, 'dropoff_lat', 'dropoff_lon')
dropoff_heat_map
All dropoff and pickups happen in Manhattan with a few parts of Brooklyn. The primary area for both the pickup and dropoff are parts in lower Manhattan.
In this section, I will explore the following KPIs:
I will also look at the one time customers as a potential growth prospect.
deliveries_per_day = delivery_info_df.groupby(['date'])['delivery_id'].count()
# Visualize using a line plot - easier to see trends that way.
deliveries_per_day.iplot(kind='line', title='Deliveries for October 2014', yTitle="Number of deliveries")
Although it might be possible to try and analyse trends from a line chart, it is probably not a good idea to draw conclusions for our use case.
This is because we want to make a plan for the market to grow in the next two months, and the dataset that we are analysing is for one month only. So, using a (small) dataset of one month, analysing the market trends (such as growth for number of deliveries, customer increase, etc.) and then assuming that our conclusions will generalise for a time period which is double our sample size is not a good practice.
So, I will focus on customer retention, as shown below.
delivery_counts_per_customer = delivery_info_df.groupby(['customer_id']).size().reset_index(name='counts')
customers_per_count = delivery_counts_per_customer.groupby(['counts']).size().reset_index(name='num_customers')
customers_per_count.head(10).iplot(x = 'counts', y ='num_customers', kind='bar',
title='Customer order frequency in Oct 2014',
xTitle='Total orders',
yTitle='Number of customers')
A lot of customers ordered only once in the data. This is something to be noted and I will be coming back to for the rest of the market analysis.
num_customers_who_ordered_once = customers_per_count[customers_per_count['counts'] == 1].iloc[0]['num_customers']
total_customers = customers_per_count.loc[:,'num_customers'].sum()
percentage_of_customers_who_ordered_once = num_customers_who_ordered_once/total_customers * 100
print("Percentage of customers who ordered only once: " + str(percentage_of_customers_who_ordered_once))
The problem statement is to come up with a plan to grow the market by 20% in two months. Given that roughly 70% of the customers ordered only once, this seems like a very promising avenue.
Note that the market may mean different things:
For the purpose of this analysis, I will take it to mean the most relevant thing: Growth of the number of deliveries.
So, I will try to figure out how many of the deliveries are by one time customers:
num_deliveries_by_one_time_customers = num_customers_who_ordered_once
num_total_deliveries = delivery_info_df.shape[0]
# Calculate total deliveries as (frequency * number of customers).
# customers_per_count['num_total_deliveries'] = customers_per_count['counts'] * customers_per_count['num_customers']
# num_total_deliveries = customers_per_count.loc[:,'num_total_deliveries'].sum()
percentage_of_deliveries_by_one_time_customers = num_deliveries_by_one_time_customers/num_total_deliveries * 100
print("Percentage of deliveries by one time customers: " + str(percentage_of_deliveries_by_one_time_customers))
So, even if we are able to retain half of the one time customers to order only once more, we will be able to grow the market by ~21% > 20%.
Ideally, one good way to do that in my opinion is to figure out why these customers are not ordering again. It might be that they did not need to order anything more than once, or it may be that they did not order again as:
It would be good to survey these customers and try and analyse responses. For the purpose of this analysis, I will check if there is something that I can infer with the data already at hand.
df_one_time_customers = delivery_counts_per_customer[delivery_counts_per_customer['counts'] == 1][['customer_id']]
list_of_one_time_customers = list(df_one_time_customers.customer_id.values)
delivery_info_df_one_time_customers = delivery_info_df.loc[delivery_info_df['customer_id'].isin(list_of_one_time_customers)]
In this section, I will explore the following KPIs:
I will also analyse all of them for potential anomalies with respect to one-time customers to explore growth prospects.
delivery_counts_per_jumpman = delivery_info_df.groupby(['jumpman_id']).size().reset_index(name='counts')
jumpmen_per_count = delivery_counts_per_jumpman.groupby(['counts']).size().reset_index(name='num_jumpmen')
jumpmen_per_count.iplot(x = 'counts', y ='num_jumpmen', kind='bar',
title='Jumpmen Delivery frequency in Oct 2014',
xTitle='Total Deliveries',
yTitle='Number of Jumpmen')
A lot of jumpmen delivered only once in the data. However, the difference is not as stark as that for the number of customers.
One possibility may be that less experienced Jumpmen deliver to the customers with a one-time purchase. If that is indeed the case, then that may be something we want to address.
Let me plot this information.
# Get the list of Jumpmen who delivered to one-time customers.
jumpmen_who_delivered_to_one_time_customers = list(delivery_info_df_one_time_customers.jumpman_id.values)
# Let us call these Jumpmen as jumpmen_under_consideration
delivery_info_df_jumpmen_under_consideration = delivery_info_df.loc[delivery_info_df['jumpman_id'].isin(jumpmen_who_delivered_to_one_time_customers)]
delivery_counts_per_jumpmen_under_consideration = delivery_info_df_jumpmen_under_consideration.groupby(['jumpman_id']).size().reset_index(name='counts')
jumpmen_under_consideration_per_count = delivery_counts_per_jumpmen_under_consideration.groupby(['counts']).size().reset_index(name='num_jumpmen')
jumpmen_under_consideration_per_count.iplot(x = 'counts', y ='num_jumpmen', kind='bar',
title='Jumpmen Delivery frequency in Oct 2014 for Jumpmen who delivered to one time customers',
xTitle='Total Deliveries',
yTitle='Number of Jumpmen')
There is a slight skew towards less experienced Jumpmen but the difference is not so much that it might require further digging.
vehicle_type = delivery_info_df['vehicle_type'].value_counts(normalize=0)
import plotly.express as px
fig = px.pie(vehicle_type, values=vehicle_type, names=vehicle_type.index)
fig.update_traces(textposition='outside', textinfo='percent+label')
fig.show()
I will analyse the same information for one time customers, to make sure there are no anomalies.
vehicle_type_one_time_customers = delivery_info_df_one_time_customers['vehicle_type'].value_counts(normalize=0)
fig = px.pie(vehicle_type_one_time_customers, values=vehicle_type_one_time_customers, names=vehicle_type_one_time_customers.index)
fig.update_traces(textposition='outside', textinfo='percent+label')
fig.show()
Again, the statistics are roughly the same as repeating customers.
def RoundDown(num, divisor):
return num - (num % divisor)
delivery_times_df = delivery_info_df
delivery_times_df['delivery_time_rounded'] = RoundDown(delivery_times_df['complete_delivery_time'], 10)
delivery_time_counts = delivery_times_df.groupby(['delivery_time_rounded']).size().reset_index(name='counts')
delivery_time_counts.iplot(x = 'delivery_time_rounded', y ='counts', kind='bar',
title='Delivery Time frequency in Oct 2014',
yTitle='Number of deliveries',
xTitle='Delivery Time in Minutes')
I will analyse the same information for one time customers, to make sure there are no anomalies.
delivery_times_df_one_time_customers = delivery_info_df_one_time_customers
delivery_times_df_one_time_customers['delivery_time_rounded'] = RoundDown(delivery_times_df_one_time_customers['complete_delivery_time'], 10)
delivery_time_counts_one_time_customers = delivery_times_df_one_time_customers.groupby(['delivery_time_rounded']).size().reset_index(name='counts')
delivery_time_counts_one_time_customers.iplot(x = 'delivery_time_rounded', y ='counts', kind='bar',
title='Delivery Time frequency in Oct 2014 for one time customers',
yTitle='Number of deliveries',
xTitle='Delivery Time in Minutes')
As before, the statistics are roughly the same as repeating customers.
In this section, I will explore the following KPIs:
goods_info_columns = ['delivery_id', 'customer_id', 'jumpman_id',
'when_the_delivery_started', 'pickup_place', 'place_category',
'pickup_lat', 'pickup_lon', 'dropoff_lat', 'dropoff_lon', 'vehicle_type']
goods_info_df = df[goods_info_columns]
goods_info_df = goods_info_df.drop_duplicates()
goods_info_df.info()
We can see that there are null rows for these items. Hence, we cannot analyze them without worrying about data integrity issues due to missing values affecting our analysis.
For our purpose, let us drop the null rows altogether and consider the exploratory analysis for the non-null rows only.
goods_info_df = goods_info_df.dropna()
goods_info_df.info()
merchant_delivery_count = goods_info_df.groupby(['place_category']).size().reset_index(name='counts')
merchant_delivery_count = merchant_delivery_count.sort_values(by=['counts'], ascending=False)
merchant_delivery_count = merchant_delivery_count.head(20)
merchant_delivery_count = merchant_delivery_count.sort_values(by=['counts'], ascending=True)
merchant_delivery_count.iplot(x = 'place_category', y ='counts', kind='bar',
title='Top merchant place categories by delivery frequency in Oct 2014',
yTitle='Merchant place category',
xTitle='Total deliveries', orientation='h')
This might be slightly flawed as we have removed a lot of rows. However, assuming that the place_category is not populated without bias towards a certain kind of category, i.e., somewhat randomly, then the bar graph can be reasonably used to estimate trends, instead of actual values.